Audio Analyzer — Phase 1

A practical audio intelligence system for generating accurate, human-readable descriptions of audio clips — suitable for building searchable audio catalogs.

What This Does

This system takes audio files as input and produces:

A concise one-sentence description of what is heard
A detailed paragraph covering temporal structure and acoustic character
A structured tag list (controlled vocabulary)
An ordered list of sound events (temporal breakdown)
A confidence score for the overall description

Everything is designed for cataloging accuracy — no emotion analysis, no cinematic interpretation, no hallucinated context.

Example output:

{
  "file_name": "metal_impact_01.wav",
  "short_description": "A sharp metallic impact followed by a brief echo.",
  "detailed_description": "The clip contains a metallic impact. The temporal sequence is: metallic impact → short echo tail. The sound has a sharp, transient attack.",
  "tags": ["impact", "metallic impact", "percussive", "sharp transient", "reverb"],
  "sound_events": ["metallic impact", "short echo tail"],
  "confidence": 0.84
}

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Audio Analyzer — Phase 1             │
│                                                         │
│   Audio File (.wav / .mp3 / .flac / .ogg / .m4a)       │
│         │                                               │
│         ▼                                               │
│   ┌──────────────┐                                      │
│   │ AudioLoader  │  librosa load + normalize to 48kHz   │
│   └──────┬───────┘                                      │
│          │                                              │
│          ├──────────────────────────┐                   │
│          ▼                          ▼                   │
│   ┌──────────────┐         ┌─────────────────┐         │
│   │FeatureExtract│         │   CLAPTagger    │         │
│   │   (librosa)  │         │(laion/larger_   │         │
│   │              │         │ clap_general)   │         │
│   │ - RMS energy │         │                 │         │
│   │ - Spectral   │         │ Full-clip       │         │
│   │ - Transients │         │ zero-shot       │         │
│   │ - Band energy│         │ classification  │         │
│   │ - Silence    │         │                 │         │
│   └──────┬───────┘         │ Sliding window  │         │
│          │                 │ event detection │         │
│          │                 └────────┬────────┘         │
│          │                          │                   │
│          └──────────┬───────────────┘                   │
│                     ▼                                   │
│             ┌──────────────┐                            │
│             │ Description  │  Template engine:          │
│             │ Synthesizer  │  tags + events + features  │
│             │              │  → natural language        │
│             └──────┬───────┘                            │
│                    │                                    │
│                    ▼                                    │
│             ┌──────────────┐                            │
│             │  Serializer  │  JSON / Markdown / CSV     │
│             └──────┬───────┘                            │
│                    ▼                                    │
│             AudioAnalysisRecord                         │
│             (Pydantic validated)                        │
└─────────────────────────────────────────────────────────┘

Why CLAP + Template Engine (not an Audio LLM)?

Property	CLAP + Templates	Audio LLM (e.g. Qwen-Audio)
Hallucination risk	None (labels are fixed)	Present
Consistency	Deterministic per run	Variable
Speed	Fast (< 2s/clip on GPU)	Slow (5–20s/clip)
GPU memory	~4 GB	14–40 GB
Catalog vocabulary	Controlled	Free-form
Phase 1 suitability	✓ Ideal	Phase 2 enhancement

Project Structure

audio_analyzer/
├── timbre.py                   # Root CLI entrypoint
├── analyze.py                  # Compatibility wrapper for single-file CLI
├── batch_process.py            # Compatibility wrapper for batch CLI
├── pyproject.toml              # Poetry dependency metadata
├── requirements.txt
├── setup_mac.sh                # macOS Silicon setup (M1/M2/M3/M4)
├── setup_runpod.sh             # RunPod GPU environment setup
│
├── config/
│   ├── config.yaml             # Model, analysis, output settings
│   └── vocabulary.yaml         # Controlled vocabulary (13 categories, ~194 labels)
│
├── src/
│   ├── cli/
│   │   ├── main.py             # Top-level Click CLI with subcommands
│   │   ├── analyze.py          # Single-file analysis command
│   │   ├── batch.py            # Batch analysis command
│   │   └── cache.py            # Label-cache builder command
│   └── timbre/
│       ├── config_loader.py    # YAML config loader + logging setup
│       ├── pipeline.py         # Main orchestrator (AudioAnalysisPipeline)
│       ├── ingestion/
│       │   └── audio_loader.py # Load + validate + normalize audio files
│       ├── models/
│       │   └── clap_tagger.py  # CLAP zero-shot classification wrapper
│       ├── analysis/
│       │   ├── feature_extractor.py        # Acoustic features (librosa)
│       │   ├── event_detector.py           # Sliding-window event detection
│       │   └── description_synthesizer.py  # Natural language description builder
│       └── output/
│           ├── schema.py       # Pydantic AudioAnalysisRecord model
│           ├── serializer.py   # JSON / Markdown / CSV per-file output
│           └── catalog_builder.py  # Multi-file catalog aggregation
│
└── outputs/                    # Default output location
    ├── json/                   # Per-file JSON
    ├── markdown/               # Per-file Markdown review reports
    ├── catalog.md              # Full catalog grouped by category
    ├── catalog.csv             # Flat CSV catalog
    └── batch_results.json      # All records in one JSON array

Running Locally — macOS Silicon (M1/M2/M3/M4)

Requirements

macOS 12.3 (Monterey) or later — required for MPS support
Apple Silicon Mac (M1 or newer)
Homebrew
Python 3.10+

Device behaviour

On Apple Silicon, PyTorch uses MPS (Metal Performance Shaders) — the GPU backend for Apple's unified memory architecture. The system detects it automatically in order of priority:

CUDA (NVIDIA) → MPS (Apple Silicon) → CPU

fp16 is automatically disabled on MPS — CLAP runs in fp32, which is correct and stable. Expect ~3–8x slower than a dedicated NVIDIA GPU but much faster than CPU-only.

Setup

cd audio_analyzer
bash setup_mac.sh

This creates a .venv in the project root, installs ffmpeg via Homebrew, PyTorch with MPS support, all Python dependencies, and pre-downloads the CLAP model (~1.2 GB).

Optional: Poetry workflow

pyproject.toml is the dependency source of truth. The checked-in requirements.txt is generated from Poetry during release with poetry export --format requirements.txt --without-hashes --only main.

PyTorch installation is still platform-specific for setup:

macOS: install torch separately before poetry install
RunPod/CUDA: install torch, torchaudio, and torchvision together with the matching CUDA wheel index before poetry install

Example:

cd audio_analyzer
poetry install
poetry run timbre analyze samples/0_sample.wav

The Makefile also supports Poetry directly:

make install
make run USE_POETRY=1 FILE=samples/0_sample.wav
make batch USE_POETRY=1 DIR=./samples

Poetry console scripts are also defined:

poetry run timbre analyze samples/0_sample.wav
poetry run timbre batch ./samples
poetry run timbre vocab cache --force

Run

Activate the virtual environment, then run:

source .venv/bin/activate
python timbre.py analyze samples/0_sample.wav
python timbre.py batch ./samples/

Or skip activation and use the venv Python directly:

.venv/bin/python timbre.py analyze samples/0_sample.wav

To inspect the configured profiles:

python timbre.py analyze --list-profiles
python timbre.py batch --list-profiles
python timbre.py profile list
python timbre.py profile inspect precise

To run with a specific profile:

python timbre.py analyze samples/0_sample.wav --profile precise
python timbre.py batch ./samples --profile fast

To run several profiles in one command:

python timbre.py analyze samples/0_sample.wav \
  --profile balanced \
  --profile precise \
  --profile conservative

python timbre.py batch ./samples \
  --profile fast \
  --profile precise

To sweep every named profile in the config:

python timbre.py analyze samples/0_sample.wav --all-profiles
python timbre.py batch ./samples --all-profiles

Outputs are scoped automatically by profile name. For example, with --profile precise and the default config, artifacts are written under ./out/precise/.

To confirm MPS is active, look for this line in the output:

[INFO] timbre.models.clap_tagger: Loading CLAP model: laion/larger_clap_general on mps

Running With Docker

The project now supports a simple CPU-only Linux Docker image for distribution. The container exposes the existing timbre CLI directly, uses the bundled config/config.yaml and config/vocabulary.yaml by default, and downloads the CLAP model from Hugging Face on first run.

Build the image

docker build -t timbre .

Export images to share with friends

If you want to send the image directly instead of publishing it to a registry, export one tarball per CPU architecture:

make docker-export-arm64
make docker-export-amd64

Or build both at once:

make docker-export-all

This produces:

dist/timbre-arm64.tar for Apple Silicon users
dist/timbre-amd64.tar for Intel/AMD users

The easiest way to share it with non-technical users is to send:

the right tar file for their machine
timbre-docker.sh

They can then load and run the image with simple commands:

bash timbre-docker.sh load
bash timbre-docker.sh analyze /path/to/example.wav
bash timbre-docker.sh batch /path/to/folder

If they prefer using Docker directly, they can still import the right file with:

docker load -i timbre-arm64.tar

or:

docker load -i timbre-amd64.tar

Analyze one file

Mount input audio read-only and an output directory read-write:

bash timbre-docker.sh analyze /path/to/example.wav

The equivalent raw Docker command is:

docker run --rm \
  -v "$PWD/samples:/data/in:ro" \
  -v "$PWD/out:/data/out" \
  timbre analyze /data/in/example.wav --output-dir /data/out

Batch analyze a directory

bash timbre-docker.sh batch /path/to/folder

The equivalent raw Docker command is:

docker run --rm \
  -v "$PWD/samples:/data/in:ro" \
  -v "$PWD/out:/data/out" \
  timbre batch /data/in --output-dir /data/out

Reuse the Hugging Face cache

To avoid downloading the CLAP model on every fresh container run, mount a persistent cache directory:

docker run --rm \
  -v "$PWD/samples:/data/in:ro" \
  -v "$PWD/out:/data/out" \
  -v "$PWD/.hf-cache:/root/.cache/huggingface" \
  timbre analyze /data/in/example.wav --output-dir /data/out

Use a custom config or vocabulary

Mount your custom files and pass them through the existing CLI options:

docker run --rm \
  -v "$PWD/samples:/data/in:ro" \
  -v "$PWD/out:/data/out" \
  -v "$PWD/config:/data/config:ro" \
  timbre analyze /data/in/example.wav \
    --output-dir /data/out \
    --config /data/config/config.yaml \
    --profile precise \
    --vocab /data/config/vocabulary.yaml

Notes

This first Docker workflow is CPU-only; no CUDA or GPU container support is included yet.
The first run may take longer because model weights are downloaded at runtime.
The recommended contract is to mount inputs read-only, outputs writable, and optionally persist /root/.cache/huggingface.
Manual sharing is the simplest path right now: build locally, send the correct dist/*.tar file plus timbre-docker.sh, and let friends use the wrapper script instead of writing Docker commands by hand.

Deploying to RunPod

Requirements

RunPod pod with at least one GPU (A10G, RTX 3090, A100, etc.)
Recommended template: RunPod PyTorch 2.4 (CUDA 12.4, Ubuntu 22.04)
Your local machine needs: ssh, scp (both standard on macOS/Linux)

Note: torch >= 2.6.0 is required due to CVE-2025-32434. The setup script handles this upgrade automatically, including upgrading torchvision and torchaudio together to avoid version conflicts.

Step 1 — Start a pod and get SSH details

In the RunPod UI, start a pod and click Connect → SSH. You'll get connection details that look like:

ssh root@194.68.245.147 -p 22017 -i ~/.ssh/id_ed25519

Step 2 — Upload the project with scp

From your local machine, in the directory containing audio_analyzer/:

scp -P 22017 -i ~/.ssh/id_ed25519 -r ./audio_analyzer root@194.68.245.147:~/

The -P flag (uppercase) sets the port for scp — note this differs from ssh which uses lowercase -p.

To also upload audio samples:

scp -P 22017 -i ~/.ssh/id_ed25519 -r ./my_samples root@194.68.245.147:~/audio_analyzer/samples/

Step 3 — SSH into the pod and run setup

ssh root@194.68.245.147 -p 22017 -i ~/.ssh/id_ed25519
cd ~/audio_analyzer
bash setup_runpod.sh

The setup script will:

Auto-detect your CUDA version and select the right PyTorch wheel
Create a .venv virtual environment in the project root
Install torch + torchaudio + torchvision together into the venv
Install ffmpeg and libsndfile1
Install all remaining Python dependencies from the release-generated requirements.txt
Pre-download and cache the CLAP model (~1.2 GB)
Verify everything works

Step 4 — Run the analyzer

Activate the virtual environment first:

source .venv/bin/activate

Single file:

python timbre.py analyze samples/0_sample.wav

Single file with all outputs:

python timbre.py analyze samples/0_sample.wav --output-dir ./outputs --markdown --full

Batch — entire folder:

python timbre.py batch ./samples/ --output-dir ./outputs

Re-syncing after local edits

If you update the code locally and want to push changes to the pod, use scp again. Because scp always overwrites, it's safe to re-run:

# Re-upload only the src/ folder (faster than uploading everything)
scp -P 22017 -i ~/.ssh/id_ed25519 -r ./audio_analyzer/src root@194.68.245.147:~/audio_analyzer/

# Or re-upload specific files
scp -P 22017 -i ~/.ssh/id_ed25519 \
  ./audio_analyzer/src/timbre/models/clap_tagger.py \
  root@194.68.245.147:~/audio_analyzer/src/timbre/models/

Retrieving outputs from the pod

Copy the outputs folder back to your local machine:

scp -P 22017 -i ~/.ssh/id_ed25519 -r \
  root@194.68.245.147:~/audio_analyzer/outputs \
  ./outputs_from_pod

Or just the catalog files:

scp -P 22017 -i ~/.ssh/id_ed25519 \
  root@194.68.245.147:~/audio_analyzer/outputs/catalog.md \
  root@194.68.245.147:~/audio_analyzer/outputs/catalog.csv \
  root@194.68.245.147:~/audio_analyzer/outputs/batch_results.json \
  ./outputs_from_pod/

Tip — save the pod's SSH config locally

Add an entry to ~/.ssh/config so you don't have to type the full connection string every time:

Host runpod-audio
    HostName 194.68.245.147
    Port 22017
    User root
    IdentityFile ~/.ssh/id_ed25519

Then you can use shorthand for everything:

ssh runpod-audio
scp -r ./audio_analyzer runpod-audio:~/
scp -r runpod-audio:~/audio_analyzer/outputs ./outputs_from_pod

Usage

Single file

python timbre.py analyze path/to/file.wav

Options:

Flag	Description
`--output-dir` / `-o`	Directory to save output files
`--profile`	Named profile from `config.yaml` (repeatable)
`--all-profiles`	Run every named profile from `config.yaml`
`--list-profiles`	Print configured profiles and exit
`--markdown`	Also save a per-file Markdown review report
`--full`	Save full JSON (includes metadata + acoustics)
`--no-windowed`	Disable sliding-window event detection (faster)
`--quiet` / `-q`	Suppress console output

Batch folder

python timbre.py batch ./samples/

Options:

Flag	Description
`--output-dir` / `-o`	Root output directory
`--profile`	Named profile from `config.yaml` (repeatable)
`--all-profiles`	Run every named profile from `config.yaml`
`--list-profiles`	Print configured profiles and exit
`--catalog`	Generate `catalog.md` (default: on)
`--csv`	Generate `catalog.csv` (default: on)
`--markdown`	Save per-file Markdown reports
`--full`	Full JSON output per file
`--limit N`	Only process first N files (useful for testing)
`--no-windowed`	Disable sliding-window event detection

Profiles

Profiles let you A/B CLAP inference settings without editing code. The runtime config is selected from config/config.yaml, merged with the base settings, and stamped into every output record as provenance.

Common workflow:

# See the available profiles
python timbre.py analyze --list-profiles
python timbre.py profile list

# Run the **same** file with two profiles
python timbre.py analyze samples/0_sample.wav --profile balanced
python timbre.py analyze samples/0_sample.wav --profile precise

# Run several profiles in one pass
python timbre.py analyze samples/0_sample.wav \
  --profile balanced \
  --profile precise \
  --profile sensitive

# Batch compare two profiles across a folder
python timbre.py batch ./samples --profile fast
python timbre.py batch ./samples --profile precise

# Sweep every configured profile
python timbre.py batch ./samples --all-profiles

# Inspect one profile in detail
python timbre.py profile inspect precise
python timbre.py profile inspect precise --json

With the default output settings, this produces separate directories such as:

out/
  balanced/
  fast/
  precise/

Each JSON, Markdown, CSV, and catalog entry includes:

analysis_provenance.profile_name
analysis_provenance.profile_fingerprint
analysis_provenance.config_path

CLI overrides still win over the profile. For example, --profile precise --no-windowed disables windowed analysis for that run and produces a different profile fingerprint in the output provenance.

When multiple requested profiles share the same model and label cache, the CLI reuses the already-loaded CLAP resources between runs so you do not pay the model load cost repeatedly.

The dedicated profile inspection command is useful when you want to review the human-friendly label, description, effective settings, and raw YAML overrides for a profile without opening config.yaml manually:

python timbre.py profile list
python timbre.py profile inspect balanced
python timbre.py profile inspect compact_model --json

Output Formats

JSON (per file — spec format)

{
  "file_name": "footsteps_gravel.wav",
  "short_description": "Footsteps on gravel with a consistent rhythmic pace.",
  "detailed_description": "The clip contains footsteps on gravel. Secondary sounds include outdoor ambience. The sound has noticeable transient elements.",
  "tags": ["movement", "footsteps on gravel", "outdoor ambience", "percussive", "rhythmic"],
  "sound_events": ["footsteps on gravel", "outdoor ambience"],
  "confidence": 0.78
}

Markdown Catalog (excerpt)

## Impact

### `metal_hit_01.wav`

**A sharp metallic impact followed by a brief reverberant tail.**

| | |
|---|---|
| Duration   | 1.23s          |
| Label      | metallic impact |
| Confidence | ████░ 0.84     |
| Events     | metallic impact → short echo tail |
| Tags       | `impact`, `metallic impact`, `percussive`, `sharp transient` |

CSV Catalog

file_name,duration_seconds,primary_category,primary_label,confidence,short_description,...
metal_hit_01.wav,1.23,impact,metallic impact,0.84,A sharp metallic impact...,...
footsteps_01.wav,4.50,movement,footsteps on gravel,0.78,Footsteps on gravel...,...

Configuration

`config/config.yaml`

default_profile: "balanced"

base:
  model:
    model_id: "laion/larger_clap_general"
    device: null
    fp16: true
    vocab_file: "vocabulary.yaml"
    label_cache_path: ".cache/label_cache.pt"

  analysis:
    use_windowed_analysis: true
    window_seconds: 2.0
    hop_seconds: 0.5
    min_confidence: 0.25
    top_k_categories: 5

  output:
    output_dir: "./out"
    save_per_file_markdown: true
    full_json: false

profiles:
  balanced:
    label: "Balanced"
    description: "Default balance between speed, temporal detail, and category breadth."
  fast:
    label: "Fast"
    description: "Higher throughput profile for broad folder sweeps and initial triage."
    analysis:
      hop_seconds: 1.0
      top_k_categories: 3
  precise:
    label: "Precise"
    description: "Finer temporal resolution and broader category search for detailed review."
    analysis:
      hop_seconds: 0.25
      min_confidence: 0.20
      top_k_categories: 7

Profile overrides currently target CLAP inference behavior and related pipeline settings, including:

model.model_id
model.device
model.fp16
analysis.use_windowed_analysis
analysis.windowed_min_duration
analysis.window_seconds
analysis.hop_seconds
analysis.min_confidence
analysis.top_k_categories

To run a specific profile:

python timbre.py analyze samples/0_sample.wav --profile precise
python timbre.py batch ./samples --profile fast
python timbre.py validate --input ./out/precise/json --profile precise

To analyze and validate in one command:

python timbre.py analyze samples/0_sample.wav --profile precise --validate
python timbre.py batch ./samples --profile fast --validate
python timbre.py analyze samples/0_sample.wav --validate \
  --validate-backend openai --validate-model gpt-5.4-mini

`config/vocabulary.yaml`

Defines all labels CLAP classifies against. 13 categories, ~194 labels:

Category	Example Labels
`impact`	metallic impact, glass shatter, gunshot, drum hit
`movement`	footsteps on gravel, door slam, paper rustling
`ambience`	outdoor ambience, crowd murmur, ocean waves
`weather`	heavy rain, thunder, wind howl
`machinery`	engine idle, electrical buzzing, drill
`vehicles`	car passing, motorcycle, aircraft flyover
`voices`	speech, laughter, crowd cheer
`water`	water dripping, waterfall, water splash
`animals`	bird chirping, dog bark, crickets
`textures`	low rumble, white noise, vinyl crackle
`music`	piano notes, guitar strum, rhythmic beat
`alerts`	alarm beep, siren, phone ringing
`background`	background noise, silence

To add new labels: edit vocabulary.yaml and re-run. No retraining needed.

Example Outputs

File	Description	Tags	Conf
`metal_clang.wav`	A sharp metallic impact followed by a short echo.	impact, metallic impact, sharp transient, reverb	0.87
`rain_heavy.wav`	Heavy rain on a hard surface, continuous and broadband.	weather, heavy rain, broadband noise, continuous	0.91
`footsteps_wood.wav`	Fast footsteps on a wooden floor, ending with a door slam.	movement, footsteps on wood, door slam, rhythmic	0.82
`engine_idle.wav`	A low-frequency mechanical engine hum, steady and continuous.	machinery, engine idle, mechanical hum, low frequency	0.88
`forest_wind.wav`	Soft wind ambience with distant birds and gentle rustling.	ambience, wind ambience, bird chirping, continuous	0.79

Technical Notes

	macOS Silicon (MPS)	RunPod (CUDA)	CPU fallback
Setup	`bash setup_mac.sh`	`bash setup_runpod.sh`	`pip install -r requirements.txt`
Device	MPS (auto-detected)	CUDA (auto-detected)	CPU
fp16	No (fp32 only)	Yes	No
VRAM / RAM	~4 GB unified memory	~4 GB VRAM	system RAM
Speed (10s clip)	~5–15s	~2–3s	~30–60s
torch wheels	Standard pip	CUDA-specific index	Standard pip

Other notes:

pyproject.toml is the source of truth for Python dependencies; requirements.txt is release-generated from Poetry
CLAP model size: ~1.2 GB (downloaded from HuggingFace Hub on first run, then cached)
torch >= 2.6.0 required (CVE-2025-32434 — torch.load safety fix)
On RunPod: torchvision must be upgraded together with torch in a single pip command to avoid internal import conflicts in transformers

Phase 2 Roadmap

Phase	Focus	Key Addition
Phase 1	Cataloging	CLAP + templates (this system)
Phase 2	Richer descriptions	Audio LLM (Qwen-Audio, SALMONN)
Phase 3	Search	CLAP embeddings → vector database
Phase 4	Similarity	Nearest-neighbor audio retrieval
Phase 5	Streaming	Real-time pipeline (WebSocket)

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
config		config
docs		docs
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_mac.sh		setup_mac.sh
setup_runpod.sh		setup_runpod.sh
timbre.py		timbre.py

Folders and files

Latest commit

History

Repository files navigation

Audio Analyzer — Phase 1

What This Does

Architecture

Why CLAP + Template Engine (not an Audio LLM)?

Project Structure

Running Locally — macOS Silicon (M1/M2/M3/M4)

Requirements

Device behaviour

Setup

Optional: Poetry workflow

Run

Running With Docker

Build the image

Export images to share with friends

Analyze one file

Batch analyze a directory

Reuse the Hugging Face cache

Use a custom config or vocabulary

Notes

Deploying to RunPod

Requirements

Step 1 — Start a pod and get SSH details

Step 2 — Upload the project with scp

Step 3 — SSH into the pod and run setup

Step 4 — Run the analyzer

Re-syncing after local edits

Retrieving outputs from the pod

Tip — save the pod's SSH config locally

Usage

Single file

Batch folder

Profiles

Output Formats

JSON (per file — spec format)

Markdown Catalog (excerpt)

CSV Catalog

Configuration

config/config.yaml

config/vocabulary.yaml

Example Outputs

Technical Notes

Phase 2 Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 9

Contributors

Uh oh!

Languages

`config/config.yaml`

`config/vocabulary.yaml`